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ABSTRACT 

Robust low-rank matrix estimation is a topic of increasing interest, 
with promising applications in a variety of fields, from computer 
vision to data mining and recommender systems. Recent theoreti- 
cal results establish the ability of such data models to recover the 
true underlying low-rank matrix when a large portion of the mea- 
sured matrix is either missing or arbitrarily corrupted. However, if 
low rank is not a hypothesis about the true nature of the data, but 
a device for extracting regularity from it, no current guidelines ex- 
ist for choosing the rank of the estimated matrix. In this work we 
address this problem by means of the Minimum Description Length 
(MDL) principle - a well established information-theoretic approach 
to statistical inference - as a guideline for selecting a model for the 
data at hand. We demonstrate the practical usefulness of our formal 
approach with results for complex background extraction in video 
sequences. 

Index Terms — Low-rank matrix estimation, PCA, Robust 
PCA, MDL. 

1. INTRODUCTION 

The key to success in signal processing applications often depends 
on incorporating the right prior information about the data into the 
processing algorithms. In matrix estimation, low-rank is an all-time 
popular choice, with analysis tools such as Principal Component 
Analysis (PCA) dominating the field. However, PCA estimation is 
known to be non-robust, and developing robust alternatives is an ac- 
tive research field (see [ ] for a review on low-rank matrix estima- 
tion). In this work, we focus on a recent robust variant of PCA, 
coined RPCA [ ], which assumes that the difference between the 
observed matrix Y, and the true underlying data X, is a sparse ma- 
trix E whose non-zero entries are arbitrarily valued. It has been 
shown in [ ] that X (alternatively, E) can be recovered exactly by 
means of a convex optimization problem involving the rank of Y 
and the t\ norm of E. The power of this approach has been re- 
cently demonstrated in a variety of applications, mainly computer vi- 
sion (see [ ] and http : / / perception . csl . uiuc . edu /matrix- rank/ 
applications . html for examples). 

However, when used as a pure data modeling tool, with no as- 
sumed "true" underlying signal, the rank of X in a PCA/RPCA de- 
composition is a parameter to be tuned in order to achieve some 
desired goal. A typical case is model selection [3, Chapter 7], where 
one wants to select the size of the model (in this case, rank of the ap- 
proximation) in order to strike an optimal balance between the ability 
of the estimated model to generalize to new samples, and its ability to 
adapt itself to the observed data (the classic overfitting/underfitting 
trade-off in statistics). The main issue in model selection is how to 
formulate this balance as a cost function. 
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In this work, we address this issue via the Minimum Description 
Length (MDL) principle [ , ]} MDL is a general methodology for 
assessing the ability of statistical models to capture regularity from 
data. The MDL principle can be regarded as a practical implemen- 
tation of the Occam's razor principle, which states that, given two 
descriptions for a given phenomenon, the shorter one is usually the 
best. In a nutshell, MDL equates "ability to capture regularity" with 
"ability to compress" the data, using codelength or compressibility 
as the metric for measuring candidate models. 

The resulting framework provides a robust, parameter-free low- 
rank matrix selection algorithm, capable of capturing relevant low- 
rank information in the data, as in the video sequences from surveil- 
lance cameras in the illustrative application here reported. From a 
theoretical standpoint, this brings a new, information theoretical per- 
spective into the problem of low-rank matrix completion. Another 
important feature of an MDL-based framework such as the one here 
presented is that new prior information can be naturally and easily 
incorporated into the problem, and its effect can be assessed objec- 
tively in terms of the different codelengths obtained. 

2. LOW-RANK MATRIX ESTIMATION/APPROXIMATION 

Under the low-rank assumption, a matrix Y £ ]^ mXn can b e written 
as Y = X + E, where rank(X) < min{ra, n} and ||E|| < || Y||, 
where || • || is some matrix norm. Classic PCA provides the best rank- 
k approximation to Y under the assumption that E is a random ma- 
trix with zero-mean IID Gaussian entries, 

X = argmm||Y- W|| 2 , s.t. rank(W) < k. (1) 

However, PCA is known to be non-robust, meaning that the estimate 
X can vary significantly if only a few coefficients in E are modified. 
This work, providing an example of introducing the MDL frame- 
work in this type of problems, focuses on a robust variant of PCA, 
RPCA, introduced in [ ]. RPCA estimates X via the following con- 
vex optimization problem, 

X = argmin ||Y - W||, + A ||W|| # , (2) 
w 

where ||W||^ := is the nuclear norm of W (a(W)i de- 

notes the i-th singular value of W). The rationale behind (2) is as 
follows. First, the t\ fitting term allows for large errors to occur in 
the approximation. In this sense, it is a robust alternative to the £2 
norm used in PCA. The second term, A || W||^ , is a convex approx- 
imation to the PCA constraint rank(W) < k, merged into the cost 
function via a Lagrange multiplier A. 

This formulation has been recently shown to be notoriously ro- 
bust, in the sense that, if a true low-rank matrix X exists, it can 

1 While here we address the matrix formulation, the developed framework 
is applicable in general, including to sparse models, and such general formu- 
lation will be reported in our extended version of this work. 



be recovered using (2) even when a significant amount of coeffi- 
cients in E are arbitrar ily large [ ]. This can be achieved by setting 
A = 1/ ^maxfm, n}, so that the procedure is parameter-free. 



2.1. Low-rank approximation as dimensionality reduction 

In many applications, the goal of low-rank approximation is not to 
find a "true" underlying matrix X, but to perform what is known 
as "dimensionality reduction," that is, to obtain a succinct represen- 
tation of Y in a lower dimensional subspace. A typical example is 
feature selection for classification. In such cases, E is not necessarily 
a small measurement perturbation, but a systematic, possibly large, 
error derived from the approximation process itself. Thus, RPCA 
arises as an appealing alternative for low-rank approximation. 

However, in the absence of a true underlying signal X (and de- 
viation E), it is not clear how to choose a value of A that produces 
a good approximation of the given data Y for a given application. 
A typical approach would involve some cross-validation step to se- 
lect A to maximize the final results of the application (for example, 
minimize the error rate in a classification problem). 

The issue with cross-validation in this situation is that the best 
model is selected indirectly in terms of the final results, which can 
depend in unexpected ways on later stages in the data processing 
chain of the application (for example, on some post-processing of the 
extracted features). Instead, we propose to select the best low-rank 
approximation by means of a direct measure on the intrinsic abil- 
ity of the resulting model to capture the desired regularity from the 
data, this also providing a better understanding of the actual struc- 
ture of the data. To this end, we use the MDL principle, a general 
information-theoretic framework for model selection which provides 
means to define such a direct measure. 



3. MDL-BASED LOW-RANK MODEL SELECTION 

Consider a family M of candidate models which can be used to 
describe a matrix Y exactly (that is, losslessly) using some encod- 
ing procedure. Denote by L(Y\M) the description length, in bits, 
of Y under the description provided by a given model M G M. 
MDL will then select the model M e M for Y for which L(Y\M) 
is minimal, that is M = argminMeA4 L(Y\M). It is a standard 
practice in MDL to use the ideal Shannon code for translating prob- 
abilities into codelengths. Under this scheme, a sample value y with 
probability P(y) is assigned a code with length L{y) = — log P(y) 
(all logarithms are taken on base 2). This is called an ideal code be- 
cause it only specifies a codelength, not a specific binary code, and 
because the codelengths produced can be fractional. 

By means of the Shannon code assignment, encoding schemes 
L(-) can be defined naturally in terms of probability models P(-). 
Therefore, the art of applying MDL lies in defining appropriate prob- 
ability assignments P(), that exploit as much prior information as 
possible about the data at hand, in order to maximize compressibil- 
ity. In our case, there are two main components to exploit. One is the 
low-rank nature of the approximation X, and the other is that most 
of the entries in E will be small, or even zero (in which case E will 
be sparse). Given a low-rank approximation X of Y, we describe Y 
as the pair (X, E), with E = Y — X. Thus, our family of models is 
given by M — {(X, E) : Y = X + E, rank(X) < rank(Y)}. As 
E = Y — X, we index M solely by X. With these definitions, the 
description codelength of Y is given by L(Y|X) := L(X) + L(E). 
Now, to exploit the low rank of X, we describe it in terms of its 



reduced SVD decomposition, 

x = uev t u e R mxfc , 



EeR fcxfc , VeR fcxr \ (3) 



where k is the rank of X (the zero-eigenvalues and the respective 
left and right eigenvectors are discarded in this description). We now 
have L(X) = L(U) + L(E) + L(V). Clearly, such description will 
be short if rank(X) is significantly smaller than max{m, n}. We 
may also be able to exploit further structure in U, E and V. 

3.1. Encoding E 

The diagonal of E is a non-increasing sequence of k positive val- 
ues. However, no safe assumption can be made about the magnitude 
of such values. For this scenario we propose to use the universal 
prior for integers, a general scheme for encoding arbitrary positive 
integers in an efficient way [ ], 

W) = lo g* 3 '= lo S i + iogiog i + • • • + log 2.865, (4) 

where the sum stops at the first non-positive summand, and log 2.865 
is added to satisfy Kraft's inequality (a requirement for the code to 
be uniquely decodable, see [ , Chapter 5]). In order to apply (4), 
the diagonal of E, diag(E), is mapped to an integer sequence via 
[10 16 diag(E)], where [•] denotes rounding to nearest integer (this is 
equivalent to quantizing diag(E) with precision fe = 10 -16 ). 

3.2. Encoding U and V, general case 

By virtue of the SVD algorithm, the columns of U and V have unit 
norm. Therefore, the most general assumption we can make about 
U and V is that their columns lie on the respective m-dimensional 
and n-dimensional unit spheres. 

An efficient code for this case can be obtained by encoding each 
column of U and V in the following manner. Let be a column of 
U (V is similarly encoded). Since is assumed to be distributed 
uniformly over the m-dimensional unit sphere, the marginal cumula- 
tive density function of the first element uu, F(uu) = P(x < uu), 
corresponds to the proportion of vectors u that lie on the unit spher- 
ical cap of height h = 1 + uu (see Figure 1(a)). This proportion 
is given by F(uu) = A m (l + uu, 1)/S m (l) where A m (h,r) and 
Sm (r) are the area of spherical cap of height h and the total surface 
area of the m-dimensional sphere of radius r respectively. These are 
given for the case < h < r (— 1 < uu < 0) by (see [ ]), 
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t) ~ dt are the regularized incomplete Beta function and the Beta 
function of parameters a, b respectively, and r(-) is the Gamma 
function. When r < h < 2r we simply have A m (h,r) — 1 — 
A m (2r — h,r). For encoding uu we have r = 1 so that 

F(u u ) = (1/2)7(1 - ul- (m - l)/2, 1/2), -1 < u u < 0, (5) 

since 2h - h 2 = h(2 - h) = (1 + uu)[2 - (1 + uu)] = 1- u 2 u . 
Finally, we compute the Shannon codelength for uu as 



p(uu) = F'(uu) 



(a) (1 
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Fig. 1. (a) The spherical cap of radius r and height h (shown in 
gray), (b) Causal bilinear prediction of smooth 2D images. 



where in (a) we applied the Fundamental Theorem of Calculus to 
the definition of F(h) and the chain rule for derivatives. 

With uu encoded, the vector (v,2i, U3i, • • • , u m i) is uniformly 
distributed on the surface of the (m — 1) -dimensional sphere of ra- 
dius r' = 1 — \uu\, and we can apply the same formula to compute 
the probability of u 2i ,F(u 2i ) = A m - 1 (u 2 i + r' \r') / 'S m -i{r'). 

Finally, to encode the next column Ui+i, we can exploit its or- 
thogonality with respect to the previous ones and encode it as a vec- 
tor distributed uniformly over the m — i dimensional sphere cor- 
responding to the intersection of the unit sphere and the subspace 
perpendicular to [ui, . . . , m]. 

In order to produce finite descriptions L(U) and L(V), both 
U and V also need to be quantized. We choose the quantization 
steps for U and V adaptively, using as a starting point the empir ical 
standard deviation of a normalized vector, that is, 5 U — yj^/m and 
5 V — yjl/n respectively, and halving these values until no further 
decrease in the overall codelength L(Y|X) is observed. 

3.3. Encoding U predictively 

If more prior information about U and V is available, it should be 
used as well. For example, in the case of our example application, 
the columns of Y are consecutive frames of a video surveillance 
camera. In this case, the columns of U represent "eigen-frames" 
of the video sequence, while V contains information about the evo- 
lution in time of those frames (this is clearly observed in figures 2 
and 3). Therefore, the columns of U can be assumed to be piecewise 
smooth, just as normal static images are. To exploit this smoothness, 
we apply a predictive coding to the columns of U. Concretely, to 
encode the i-th column u; of U, we reshape it as an image B of 
the same size as the original frames in Y. We then apply a causal 
bilinear predictor to produce an estimate of B, B = {bji} where 
bji = bji - &j(z-i) - &(j-i)z + &(j-i)(z-i), assuming out-of-range 
pixels to be 0. The prediction residual B = B — B is then en- 
coded in raster scan as a sequence of Laplacian random variables 
with unknown parameter 0^ . This encoding procedure, common in 
predictive coding, is depicted in Figure 1(b). 

Since the parameters {6 l u , i = 1, . . . , k} are unknown, we need 
to encode them as well to produce a complete description of Y. In 
MDL, this is done using the so-called universal encoding schemes, 
which can be regarded as a generalization of classical Shannon en- 
coding to the case of distributions with unknown parameters (see [5] 
for a review on the subject). In this work we adopt the so-called 
universal two-part codes, and apply it to encode each column 
separately. Under this scheme, the unknown Laplacian parameter 
for l u is estimated via Maximum Likelihood, 6 % u (u*), and quantized 
with precision 1/ ' \fm, thus requiring L{0 l u ) — \ logra + c\ bits. 



Given the quantized 9 % u , Ui is described using the discretized Lapla- 
cian distribution L(u«) = — log P(ui \6 l u (ui)) + c 2 . Here c\ and c 2 
are constants which can be disregarded for optimization purposes. It 
was shown in [ ] that the precision 1 / yfm asymptotically yields the 
shortest two-parts codelength. 



3.4. Encoding V predictively 

We also expect a significant redundancy in the time dimension, so 
that the columns of V are also smooth functions of time (in this 
case, sample index j = 1, 2, . . . , n). In this case, we apply a first or- 
der causal predictive model to the columns of V, by encoding them 
as sequences of prediction residuals, = (vn,Vi 2 , . . . , Vin), with 
Vij = Vij —Vi(j-i) f° r i > 1 andva = vn. Each predicted column 
Vi is encoded as a sequence of Laplacian random variables with un- 
known parameter 0* . As with U, we use a two-parts code here to 
describe the data and the unknown Laplacian parameters together. 
This time, since the length of the columns is n, the codelength asso- 
ciated to each 01 is L(0 % v ) = | log n. 



3.5. Encoding E 

We exploit the (potential) sparsity of E by first describing the in- 
dexes of its non-zero locations using an efficient universal two-parts 
code for Bernoulli sequences known as Enumerative Code [ ], and 
then the non-zero values at those locations using a Laplacian model. 
In the specific case of the experiments of Section 4, we encode each 
row of E separately. Because each row of E corresponds to the pixel 
values at a fixed location across different frames, we expect some of 
these locations to be better predicted than others (for example, loca- 
tions which are not occluded by people during the sequences), so that 
the variance of the error (hence the Laplacian parameter) will vary 
significantly from row to row. As before, the unknown parameters 
here are dealt with using a two-parts coding scheme. 



3.6. Model selection algorithm 

To obtain the family of models M corresponding to all possible low- 
rank approximations of Y, we apply the RPCA decomposition (2) 
for a decreasing sequence of values of A, {At : t = 1,2,...} ob- 
taining a corresponding sequence of decompositions {(X t , E t ), t = 
1,2,...}. We obtain such sequence efficiently by solving (2) via a 
simple modification of the Augmented Lagrangian-based (ALM) al- 
gorithm proposed in [10] to allow for warm restarts, that is, where 
the initial ALM iterate for computing (Xt,E t ) is (Xt-i,Et-i). 
We then select the pair (X h E f ), i = arg mm t {L(X t ) + L(E t )}. 



4. RESULTS AND CONCLUSION 

In order to have a reference base, we repeated the experiments per- 
formed in [ ] using our algorithm. These experiments consist of 
frames from surveillance cameras which look at a fixed point where 
people pass by. The idea is that, if frames are stacked as columns 
of Y, the background can be well modeled as a low-rank compo- 
nent of Y (X), while the people passing by appear as "spurious er- 
rors" (E). Clearly, if the background in all frames is the same, it 
can be very well modeled as a rank-1 matrix where all the columns 
are equal. However, lighting changes, shadows, and reflections, 
"raise" the rank of the background, and the appropriate rank needed 
to model the background is no longer obvious. 




Fig. 2. Results for the "Lobby" sequence (see text for a description 
of the above pictures and graphs). The rank of the approximation 
decomposition for this case is A; = 10. The moment where the lights 
are turned off is clearly seen here as the "square pulse" in the middle 
of the first two right-eigenvectors (bottom-right figure). Also note 
how U2 (top-right) compensates for changes in shadows. 
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Fig. 3. Results for the "ShoppingMall" sequence (see text for a de- 
scription of the above pictures and graphs). In this case, the rank 
of the approximation decomposition is k = 7. Here, the first left- 
eigenvector models the background, whereas the rest tend to capture 
people that stood still for a while. Here we see the "phantom" of two 
such persons in the second left-eigenvector (top-right). 



Concretely, the experiments here described correspond to two 
sequences: "Lobby" and "ShoppingMall," whose corresponding re- 
sults are summarized respectively in figures 2 and 3. 2 At the top 
of both figures, the first two left-eigenvectors ui and u 2 of X are 
shown as 2D images. The middle shows two sample frames of the 
error approximation. The L-vs-A curve is shown at the bottom-left 
(note that the best A is not the one dictated by the theory in [ ], which 
are 0.007 for Lobby and 0.0035 for ShoppingMall, both outside of 
the plotted range), and the scaled right-eigenvectors cjiWi are shown 
on the bottom-right. In both cases, the resulting decomposition re- 
covered the low-rank structure correctly, including the background, 
its changes in illumination, and the effect of shadows. It can be ap- 
preciated in the figures 2-3 how such approximations are naturally 
obtained as combinations of a few significant eigen- vectors, starting 
with the average background, followed by other details. 

4.1. Conclusion 

In summary, we have presented an MDL-based framework for low- 
rank data approximation, which combines state-of-the-art algorithms 
for robust low-rank decomposition with tools from information the- 
ory. This framework is able to capture the underlying low-rank in- 
formation on the experiments that we performed, out of the box, 
and without any hand parameter tuning, thus constituting a promis- 
ing competitive alternative for automatic data analysis and feature 
extraction. 



2 The full videos can be viewed at http://www.tc.umn.edu/ 
~nacho/lowrank/. 
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